Topology-based fault analysis in telecommunications networks

ABSTRACT

A method and apparatus for detecting traffic-affecting failures in a telecommunications network; by inferring the most probable location of each such failure, given multiple alarm indicators along a network circuit; correlating circuit alarms to trunk failures, or inferring trunk failures from circuit alarms; inferring the location of major network outages by topologically correlating multiple trunk failures; and filtering alarm reporting to the Fault Management System users such that only the most significant derived or inferred conditions are automatically displayed.

BACKGROUND OF THE INVENTION

Telecommunications equipments are designed to have some means ofdetecting and reporting traffic-affecting faults. Collecting anddisplaying these fault alarms is the responsibility of the network FaultManagement System (FMS). The functional groups that are the primaryusers of the FMS are typically called Surveillance, which hasresponsibility for monitoring equipment faults and initiating repairactions, and Restoration, which has responsibility for rerouting networktraffic around an outage.

The alarms generated by network equipments typically identify theaffected equipment and the type of fault detected by that equipment.However, a single fault in a network can generate alarm reportsthroughout the network on any equipment that also transports any of thetraffic affected by that fault. It is generally the case that knowledgeof network topology (that is, the connections between equipments thatdefine the traffic paths through the network) is not present at theequipment level. Therefore, correlations exist between fault alarmreports that are not immediately obvious without considering the alarmswithin the context of the network topology.

The following description of the present invention will use the term"circuit" to mean a data traffic carrier or pathway of some specificdata capacity through a telecommunications network. Data can only beinserted or retrieved (usually both, since the traffic is two-way) fromthe end points of this circuit; all other equipments along the pathrelay the data toward the destination end point.

For efficiency of transmission, multiple circuits of the same capacityare often combined or "multiplexed" together into a single data carrier.This higher-capacity carrier will be called a "trunk", relative to thecircuits that is carries. A circuit might be carried by a series of suchhigher-level trunks on its way to its destination. But each trunk isalso a circuit: it provides a specific data-carrying capacity betweensource and destination end points, and it consists of a series oftransmission equipment connections through the network. Trunks of thesame capacity can also be multiplexed together to form even higher-leveltrunks.

The standard digital telecommunications multiplex hierarchy used in theUnited States consists of: DS-0 circuits (or Digital Signal Level 0)with a capacity of 64 kilobits per second (Kbps); DS-1 circuits of 1.544megabits per second (Mbps) or 24 DS-0s; DS-2 circuits of 6.312 Mbps or 4DS-1s; and DS-3 circuits of 44.736 Mbps or 7 DS-2s. Long-haultransmission equipment such as fiber-optic systems combine a certainnumber of DS-3s, the number being determined by the speed of thespecific technology employed. An example would be Synchronous OpticalNetwork (SONET) OC-48 (Optical Carrier Level 48) equipment, whichcombines 48 DS-3 circuits.

Typically, when a failure occurs on a circuit, the equipment closest tothe failure detects the fault ("loss of signal", for example), reportsthe fault, and propagates an alarm indicator signal in the "downstream"direction on the affected circuit. Alarms are therefore reported in thereceive direction on each side of the fault to the far ends of thecircuit. Furthermore, if that circuit is a trunk (carrying circuits of alower capacity level) then the multiplexing equipments at the trunk endsalso propagate alarm indicators downstream along those lower-levelcircuits. As a result, when a major outage occurs a large number offault alarms are reported. Without considering network topology, it isdifficult to determine how many faults there are and which alarms aresignificant for locating the faults.

Further complicating the situation is the fact that not all equipmentconnection points provide fault alarm information because of limitationsin the equipment (especially older types) or because of limitationswithin the Fault Management System itself. Moreover, the fault reportingnetwork and remote monitoring subsystems are also subject to failures,so there is always a possibility that some alarms may not be deliveredto the Fault Management System.

These complications mean that manual alarm analysis by the FMS users istedious and time-consuming. This invention is intended to augment theFMS alarm reporting by automating the process of analyzing thetransmission equipment alarms in the context of network topology,thereby allowing a faster and more accurate response to a networktraffic outage.

BRIEF DESCRIPTION OF THE PRESENT INVENTION

This invention is intended to detect, confirm, and locate major outagesin a telecommunications network. The process implemented by thisinvention uses network circuit topology to correlate equipment alarmsand provide the following results: reporting of the most-significantfault alarms; suppressing the reporting of sympathetic alarms downstreamfrom a fault; inferring a trunk outage from circuit alarms, even if nofault has been reported on the trunk; confirming that a reported trunkfault is actually causing a traffic outage if the contained circuits arealso in alarm; correlating transmission system trunk outages that sharethe same path (e.g., fiber optic pairs within the same cable); andmaking an accurate determination of the location of any faults.

Fault alarm data are collected from network multiplexer and transmissionequipments. Each alarm represents a specific fault detected on aparticular piece of equipment. These alarms are then correlated to eachother by using a database that describes the network topology; thisdatabase defines the equipments that implement the network and theconnections between equipments. These equipment connections define therouting of circuits and trunks through the network. The topologydatabase determines: which trunk or ordered sequence of trunks contain agiven circuit; which circuits are contained within a given trunk; andthe topological route through the network for any given circuit ortrunk. Using this knowledge of network topology, significant fault alarmevents (that is, those most indicative of the location of a failure) canbe distinguished from "sympathetic" events (those fault indicatorspropagated downstream and to lower multiplex levels from a failure) todetermine the topographic point of failure as accurately as possible.Moreover, correlated alarms on multiple circuits contained within thesame trunk can be used to infer an outage on that trunk, even if nofault alarms have been received from the trunk equipment. Or, if directalarms have been reported on a trunk, then corresponding alarms on thecircuits contained in that trunk serve to confirm that atraffic-affecting outage has occurred (whereas the absence of circuitalarms might indicate that the circuits have been rerouted).

The results of the present process are automatically displayed to theFault Management System users, and all input fault alarm data that wasused to determine each outage is also available to the user uponrequest. The outage extent and location information is most immediatelyuseful for initiating a traffic restoration plan and for directing theattention of field repair efforts. The outage information for a trunk isalso useful for determining the impact of the outage to customercircuits. This information can be used "pro-actively" (by notifying theaffected customers), or for correlating a customer-reported problem toan outage.

BRIEF DESCRIPTION OF THE FIGURES

The above-mentioned objects and advantages of the present invention willbe more clearly understood when considered in conjunction with theaccompanying drawings, in which:

FIG. 1 is a block diagram depicting the Fault Management Systemenvironment in which the invention operates.

FIGS. 2A-2F present a flow chart of operative steps carried out in apreferred embodiment of the invention. (It should be understood thatthese diagrams are not represented as a complete set of flow charts suchas might be prepared for a specific implementation of the invention;rather, the charts present the primary operative logic, whereas some ofthe processing details that would be required for the depictedoperations are indicated or implied in the following DetailedDescription.)

FIG. 3 is a schematic representation of a DS-3 circuit traversing twofiber-optic transmission systems with a DXC-3/3 (a DXC with DS-3 inputsand outputs) between the two systems, which will be used to describe thepertinent features of the topology database.

FIG. 4 is a schematic representation of several circuits carried byseveral trunks, which demonstrates some of the network topologicalrelationships and considerations that will be discussed in the followingDetailed Description of the invention.

DETAILED DESCRIPTION OF THE INVENTION

Telecommunications networks are typically monitored by Remote MonitoringSystems (RMSs), as shown in FIG. 1 105, collocated or in closegeographic proximity with the network equipments 107. The equipments ofinterest for this invention include digital cross-connects (DXCs 108),light terminating and regeneration equipments (LTEs 110 and LREs 114),radio transmitters and repeaters (RADs 118 and RPTs 116), andmultiplexers (MUXs 120). The RMS devices mediate the exchange of data106 with these network equipments. Fault alarms, status information andperformance statistics are collected from the equipments, and variouscontrol commands are sent to the equipments. These RMS devices reportthe collected data 104 to a central Fault Management System (FMS) 101,and the devices receive control commands from users or from automatedprocesses on the FMS. The FMS will provide a user fault alarm displayand control command interface 100.

The RMS devices are designed to communicate with the disparateequipments using a wide variety of communications protocols and dataformats. These RMSs must also exchange data with the FMS in a fairlystandard format (although the precise content of exchanged messages willtypically be specific for a particular type of equipment). The RMS willtime-stamp the fault alarm messages before sending them to the FMS. Thefault occurrence times are required by the outage analysis processdescribed herein. Therefore, the clocks on all the remote monitoringdevices must be synchronized to some known accuracy.

This invention includes a process that executes, continuously andautomatically, on the central FMS. Such FMS systems typically contain aprocess that receives messages from the RMSs, recognizes thedevice-specific format of each message, and extracts individual dataelements from the message for the convenience of other processesexecuting on the FMS (alarm reporting, for example). This inventioncontains an interface, shown in FIG. 2A, to the message receptionprocess to extract only certain selected fault alarm messages asindicated in step 201. That is, those fault alarms indicating a circuitor trunk traffic outage, plus the messages that indicate that such afault condition has now "cleared".

The first step in the process implemented by this invention is tomaintain a database of all active fault alarms 103, using theequipment-identifying data (extracted from the alarm message) as theindex key. For equipments that handle multiple circuits, such asmultiplexers and cross-connects, the alarm data must also include otherinformation (a port identifier, for example) to indicate which specificcircuit on that equipment is in alarm. New alarms (step 202) are addedto this database at step 203, and cleared alarms are removed at step204.

For new alarms (202) that are added to the active alarm database (203),the next step 211 (FIG. 2B) in this process is to use theequipment-identifying data to search the network topology database 102.

Reference is now made to FIG. 3 to describe the pertinent structure ofthe topology database. This schematic shows a DS-3 circuit (301) betweentwo M13s (DS-1-to-DS-3 multiplexers), 307 and 315. From M13 307, theDS-3 is directly connected to a port on the Light Terminating Equipment(LTE) 308, which is at the same location as M13 307. (LTE 308 willcombine several such DS-3 circuits, the number depending on thetransmission speed of the system.) This fiber optic system (305)terminates at LTE 310, with one Light Regenerating Equipment (LRE)station (309) between the two terminating sites. The DS-3 circuit isthen connected to a port on a DXC-3/3 (311), which is at the same siteas LTE 310. From the appropriate cross-connected port on DXC 311, theDS-3 is connected to another LTE (312) at the same site, whichconstitutes one end of fiber-optic system 306. System 306 passes throughLRE 313 and terminates at LTE 314. The DS-3 circuit is then connectedfrom the appropriate port on LTE 314 to M13 315, which terminates theDS-3.

The topology database can identify circuit 301 with an arbitrary butunique number. A circuit record with this key will identify the "left"and "right" circuit end points, which in this example are the M13s 307and 315 respectively. In a separate circuit "segment" table, thedatabase will define the equipment connections necessary to build theDS-3 circuit through the network. In the diagram, circuit 301 iscomposed of three segments: 302, 303, and 304. These segment records arenumbered in the "left to right" direction. (The choice of which end is"left" and which "right" is arbitrary; it is only necessary that thenumbering represents the true physical ordering of the DS-3 connectionpoints.) In each case, a segment consists of a pair of DS-3 ports. Forexample, segment 302 consists of a "left" DS-3 port on LTE 308 and a"right" DS-3 port on LTE 310. That is, segment 302 represents the entryand exit points for DS-3 circuit 301 as it traverses the fiber-optictrunk 305, and there are no other DS-3 level connections for the circuitbetween these two points. Circuit segment 304 is similar for traversingtrunk 306. Segment 303 represents the "left" and "right" DS-3 ports ofthe cross-connection made within DXC 311.

The database entry for segment 302 will contain an explicit reference tofiber-optic trunk 305 as the carrier for that segment, and segment 304will reference trunk 306 as its carrier. Segment 303 is simply across-connection, with no associated trunk.

The topology data elements necessary for this invention are thus:

Circuit table:

Circuit identifier

Circuit capacity (e.g. DS-3)

Left end site identifier

Left end equipment type (e.g. M13)

Left end equipment uniqueness identifier at that site

Left end equipment circuit identifier (e.g. port number)

Right end station identifier

Right end equipment type

Right end equipment uniqueness identifier at that station

Right end equipment circuit identifier

Segment table:

Circuit identifier

Segment sequence number

Left side site identifier

Left side equipment type

Left side equipment uniqueness identifier at that site

Left side equipment circuit identifier

Right side site identifier

Right side equipment type

Right side equipment uniqueness identifier at that site

Right side equipment circuit identifier

Circuit identifier of any carrier trunk for the segment

Returning to the flow chart in FIG. 2B, the received alarms willidentify the specific equipment reporting the alarm. Generally, thesedata are: the equipment location or site; the equipment type (whichimplies function, capacity, etc.); a uniqueness identifier for thatequipment type at that site; and usually a port number on thatequipment. The topology database search 211 will attempt to find thisconnection point, either as an end point of a circuit or of a segment.The results of this database search will include: the identifier for thecircuit (301) associated to that equipment connection point; the segmentsequence number (which indicates, as explained above, the relativelocation of that equipment point along the circuit route); and adirection indicator (which is also relative to the other equipmentpoints along the circuit route). These data are sufficient forcorrelating all fault alarms along the same circuit. That is, whenevaluating the significance of a given alarm, it is not necessary toquery the database for the full topology of the circuit, then check tosee if any alarms have been received for any of those points. Rather, ifany other alarms are associated to the same circuit identifier, then itis possible to relate the given alarm to the others, topologically,using the sequence numbers and direction indicators found in thedatabase. Specifically, for the purpose of evaluating a given alarm'ssignificance, it is possible to determine whether any of theseconditions exist: another alarm is "upstream" from and in the sametraffic direction as the given alarm point; another alarm is "upstream"in the opposite direction (or "return path") from the given point;another alarm is on the next adjacent circuit segment, either in thesame or opposite direction; or another alarm exists on the opposite sideof the same circuit segment as the given point.

In step 212 the active alarm that is being processed in step 203 isadded to the Active Alarms Database.

The analysis of a given fault alarm (213) is largely dependent upon theprecise behavior of the reporting equipment. Therefore, it is necessaryto consider the device type and the alarm type in determining thesignificance of a given alarm. This part of the analysis might beimplemented with "truth tables" or with a rule-based "inference engine",or a combination of both. Each truth table entry or rule condition mustfirst specify an alarm and device type to be operated upon and theconditions necessary to consider that alarm "significant" (with respectto any other alarms present on the same circuit). (Although the analysisprocess might also be implemented directly in code, using truth tablesor inference engine rules allows ease of maintenance when new alarmanalysis requirements are identified by the system users or when newequipment types are added to the network. The topology-based faultanalysis implemented by this invention will be described in very generalterms, but in practice it is highly desirable to have precise processingand reporting control over each specific alarm produced by eachparticular type of equipment.)

Of consideration in implementing this analysis process is the difficultyof making all rules or truth table entries mutually exclusive, such thatone and only one rule condition or table entry will be found to be true.Constructing each rule such that it is explicitly exclusive of allothers would require constructing each with full knowledge of all theother rules; this would be tedious and would create a potential forproblems if any rules need to be changed in the future. It is thereforedesirable to arrange that rules or truth table entries will be evaluatedin some prioritized order, such that the "most significant" result willbe derived first and any other possibilities will not be evaluated. Ingeneral, the "most significant" result is the one that is "mostinformative" or is indicative of the "most serious" problem, but thechoice is sometimes arbitrary.

Reference is now made to FIG. 2C, which depicts the effective algorithmimplemented by the first level of fault analysis rules. In these rules,one important distinction to be made in alarm types is that betweensignal transmission (output) faults and signal reception (input) faults(231). A transmission fault needs no further analysis by the topologicalcorrelation process. The exact cause, location, and traffic-bearingimpact of the fault is fully known. A transmission fault is thereforereported immediately as a significant event (232). One of the rules foranalyzing DS-3 DXC equipment alarms, for example, would have thisgeneral form:

    IF a "lost transmit output" alarm occurs on a DXC DS-3 port THEN report the alarm as a significant event.

Some equipments will detect and report a signal reception failure (e.g.,"loss of signal") that can only be caused by a circuit fault immediatelyupstream from the equipment. Such alarms are generally taken assignificant for locating the location of the failure (233). Unlike thetransmission fault alarms, however, the cause and precise location ofthe problem is not known, so these alarms need to be correlated withother alarms to determine if a higher-order outage can be inferred (asexplained below). This analysis is done in step 234. The DS-3 DXC rulefor processing this type of alarm might have this (somewhat simplified)form:

    IF a "loss of signal" alarm occurs on a DXC DS-3 port THEN analyze for possible trunk alarm upstream.

If the alarm message indicates only that some signal fault has beendetected somewhere upstream (e.g., "alarm indicator signal" in the samedirection or "return alarm indicator" in the opposite direction), thenthe source of the problem could be in any of the upstream equipments ortransmission media, either at the same multiplex level (i.e., on thesame circuit) or at any higher level (i.e., on any trunk containing thecircuit). Determining whether or not such an alarm is a significantevent for locating the fault requires that the truth table entry orinference engine rule be able to test for other alarms upstream from thegiven alarm (235). A typical rule for DS-3 DXC equipment might be:

    IF an "alarm indicator signal" occurs on a DXC DS-3 port AND no other fault alarms exist upstream on the same DS-3 circuit THEN analyze for possible trunk alarm upstream.

(In a "truth table" implementation, each distinct test necessary for anyrule would be represented as a column in a table, and each rule wouldspecify a value of "true", "false", or "don't care" for each of thesecolumns. An additional column would specify an action to be taken if allof these condition specifications are satisfied when the rule is appliedto a given alarm.)

Note that alarms are received one at a time in an unpredictable order.It is therefore not desirable to check immediately for any upstreamalarm conditions; they might not have been processed yet. A small delaytime should be introduced into the evaluation of these rules to allowany possibly correlated alarms to be processed.

Also note that any alarms that are actually correlated to the sameproblem should have approximately the same reporting time-stamp (affixedby the RMS devices), plus or minus the device-dependent alarm reportinglatency and the maximum difference between remote clock times. Thistime-stamp correlation condition should also be specified by the alarmanalysis rules.

In FIG. 2C, any circuit alarm that appears to be significant by test 235is also presented to process 234 for analysis of possible upstream trunkfaults. The flow chart for process 234 is shown in FIG. 2D.

The first step in this analysis (241) is to determine if the given alarmis on a transmission system at the highest multiplex level (whichtherefore is not carried by any higher-level trunks), or on some lowerlevel circuit. Any circuit alarms below the transmission system level(that is, the DS-3 level and below) are presented to amultiplex-hierarchy level of analysis.

This hierarchy analysis attempts to correlate circuit alarms to anyupstream trunk alarms or to infer trunk failures from alarms on thecontained circuits. The topology database is searched (242) to get alist of all trunks upstream from the reported fault alarm. (ReferencingFIG. 3 again for an example, if an alarm is received on the "right" sideof DXC 311, which is in segment 303 of circuit 301, then all highersegments for 301 would be retrieved and all trunk associations for thosesegments would be returned, which in this example would only be trunk306.) A failure on any of these upstream trunks could be the cause ofthe circuit alarm indicator signal.

Each of these upstream trunks is processed in turn (243). On each trunk,a circuit alarm counter is incremented (244). The directionality of thecircuit alarm with respect to the trunk is significant and separatecounters must be maintained for circuit alarms in each direction. (Forsimplicity, this complication will generally be omitted from thefollowing discussion; but in all references to circuit alarm counts ontrunks, each direction on the trunk must be considered separately.) Thiscircuit alarm count serves two purposes: first, if an explicit faultalarm is reported for that trunk, then the presence of alarms on thecontained circuits provides a confirmation that the trunk fault isactually causing a traffic outage; and second, a fault on a trunk can beinferred if a majority of the circuits on that trunk report alarms.

For efficiency in later processing, some additional processing (245) canbe performed as the circuit alarm is counted on each of the upstreamtrunks. If the circuit alarm is the first alarm to be counted on a giventrunk, or if the time-stamp of the alarm falls outside the window forpresuming correlation with any previous alarms, then the time-stamp ofthat alarm and the set of all upstream trunks are stored in the datastructure representing the trunk. Otherwise, if the circuit alarm is notthe first one to be counted on a given trunk and the time-stamp of thatalarm is within the window necessary for presuming correlation with theprevious alarms, then the set of upstream trunks for the new alarm isintersected with that of the previous alarm or alarms (that is, alltrunks common to both sets are extracted), and the new list is stored inthe trunk data structure. This intersection set will be referred to asthe "common path set" for the circuits on the trunk: at any given time,this is the set of trunks that contain all of the same circuits as thosecounted on the given trunk. (This set always contains the given trunkitself, and it may contain only that trunk if the circuits do not haveany other trunks in common.) The significance of this common path set isthat the circuit alarms counted on the given trunk could actually becaused by an outage on any of these trunks.

Every time that a circuit alarm counter is incremented on a given trunk,then that trunk is evaluated (246) to determine if a fault can beinferred from the circuit alarms or if a reported trunk fault can beconfirmed to be affecting traffic on the contained circuits.

For maximum flexibility, inferring or confirming a trunk outage fromcircuit alarms can also be handled with a rule-based inference engine.For example, rules might be written that specify different circuit alarmthresholds depending on the total number of monitored circuits (whichmay not be all of the circuits carried by the trunk). These rules mightrequire that there be a minimum number of monitored circuits on a trunkfor any outage to be inferred, and decreasing percentage thresholdsmight be specified for increasing monitored circuit counts. Thesethresholds should be set low enough that outages will be inferred evenif some alarms do not get reported for some reason, yet high enough thatfalse inferences are not very probable. False inferences are made muchless probable if, again, the time-stamp of each circuit alarm is takeninto consideration by the rules: if a trunk failure has caused thecircuit alarms, then the circuit alarm times should be within thepreviously mentioned time frame. Special rules might also be written forspecific equipment types that take into account any alarms orcombinations of alarms unique to that equipment type.

This process, Evaluate Circuit Alarm on Trunk (246), is described inFIG. 2E. Step 261 determines if any fault has already been directlyreported (rather than inferred) for that trunk. If so, then the circuitalarm count is compared to some specified "confirmation threshold" value(say for example, 51% or "more than half") in step 268 . That is, if atrunk fault has been reported and a sufficient number of circuit alarmshave also been reported on that trunk, then the circuit alarms areassumed to confirm that the trunk fault is affecting traffic. If thetrunk fault status is not already set to "confirmed" (269), then it isso set in step 270.

If no direct fault has been reported (261), then the current circuitalarm count on the trunk is compared to some specified "inferred faultthreshold" (say again, 51% or "more than half") in step 262. Theinference is that if a sufficient number of circuit alarms have beenreported, then a trunk fault can be assumed. the trunk fault status isset to "inferred" in step 263.

In fact, if a trunk has sufficient circuit alarms to infer a trunkfault, it is not necessarily a fault on that particular trunk; thecircuit alarms could be caused by a fault on any of the trunks thatcontain the same set of circuits. It is often the case that a given setof circuits traverse the same set of trunks between monitoring points.In such cases, several trunks may cross their circuit-alarm thresholdsbecause of the same set of circuit alarms. Moreover, if one trunk hasreached its circuit-alarm threshold, it is possible that some othertrunk containing the same circuits has a directly reported fault, so afault ought not be inferred on the given trunk. For these reasons,whenever a trunk fault might be inferred from circuit alarms only, anadditional step, 264, is necessary to determine the minimal set oftrunks on which faults would explain all known circuit alarms.

The general logic of this minimal-trunk-fault determination, which isinvoked only for inferred trunk outages, is this:

1. If the set of circuits that are in alarm for a given trunk are alsocontained in another trunk or set of trunks on which trunk faults havebeen directly reported, then no outage should be inferred on the giventrunk.

2. If the set of circuits that are in alarm for a given trunk are aproper subset of the circuits in alarm for another trunk which hascrossed its circuit alarm threshold, then no outage should be inferredon the given trunk.

Otherwise, the trunk should be reported as a possible outage. Referenceis now made to FIG. 4, which is schematic representation of severalcircuits carried by several trunks, to explain these considerations.Assume that 401 is a group of circuits that begin at site A and arecarried by trunk 403 to site C, then by trunk 405 to site D, trunk 406to site E, and finally by trunk 407 to site F. Another group ofcircuits, 402, begins at site B and are carried by trunk 404 to site C,then by trunk 405 to site D, trunk 406 to site E, and finally by trunk408 to site F. Furthermore, assume that the circuit groups 401 and 402have reported circuit alarms, such that all six of the trunks shown inthe diagram have sufficient alarms (according to the specified thresholdcount) to infer that some trunk fault is causing the circuit alarms.

The reasoning behind the first rule above is simply that if a set ofcircuits passes through a trunk which has a directly reported fault,then no other trunk outages should be inferred from that set of circuitalarms. For example, suppose that trunk 405 has a directly reportedfault. When trunks 403, 404, 406, 407, and 408 are individuallyevaluated by step 264 in FIG. 2E (because each is over its circuit alarmthreshold), the trunk fault on 405 should be recognized as the probablecause of all the circuit alarms on each of those trunks, and no othertrunk fault should be inferred.

The reasoning behind the second rule above is that if another trunkcontains the same circuits plus one or more additional circuits that arein alarm (and the additional alarms occurred within the same timeframe), then an outage on this other trunk would explain the circuitalarms on the given trunk, but not vice versa. (When the other trunk isevaluated, it will be reported as an inferred outage because Rule 2 isnot true.) For example in FIG. 4, assume that no trunks have reportedany faults. When trunk 403 is evaluated, the process should recognizethat the same circuits (401) pass through trunks 405, 406, and 407. Now,a fault on either trunk 405 or 406 would explain all of the circuitalarms in group 401, but a fault on trunk 403 would not explain any ofthe circuit alarms in group 402, which are also present on trunks 405and 406. Therefore, when trunk 403 is evaluated, no fault should beinferred on that trunk. The same is true when trunks 404, 407, and 408are evaluated: trunks 405 and 406 carry the same circuits reportingalarms plus additional circuit alarms that cannot be explained by faultson any of those trunks.

If neither of these rules applies to a given trunk, then an inferredoutage should be reported for the trunk. However, it is undesirable toreport several separate trunk outages that were all inferred from thesame set of circuit alarms; it is more accurate (and less confusing tothe users) to report that a trunk outage has been inferred, and toinclude in that one report all possible trunks implicated by the sameset of circuit alarms. In FIG. 4, when trunk 405 is evaluated, neitherof the above rules applies, so a trunk fault should be inferred for 405,but since trunk 406 carries exactly the same set of circuits, then asingle fault should be reported to the FMS users which indicates thateither 405 or 406 could be the location of the fault.

An algorithm that implements these considerations is described in theflowchart (FIG. 2F). As noted previously, each time that a circuit alarmhas been counted on a given trunk, the set of trunks upstream from thatcircuit point have been intersected with the similar set of trunks forany other circuit alarms on that trunk to form the "circuit common path"set; these are the only trunks that need to be examined.

First, in step 501 each of the trunks in the circuit common path set forthe given trunk is examined to determine if any these has alreadyreported a trunk alarm; if so, by Rule 1 above, no fault should beinferred on the trunk being evaluated, so no further action is taken.

Otherwise, in step 502 the common path set of trunks is shortened toinclude only those that are also over their alarm-count thresholds(which list always includes, at least, the trunk under evaluation). Theassumption made here is that any trunk that is not over itscircuit-alarm threshold is probably not the location of the inferredoutage. The final result set is initialized to be this same set in step503. To determine the minimal set of trunks that should be included inthe outage report (that is, possible fault locations), each of thetrunks in the shortened common path set is then examined (504). Thecommon path set for each (compiled for the circuit alarms on that trunk)is likewise shortened to include only those that are also over theirthresholds (505). The intersection across all of these sets (that is,those trunks that are in all of the lists), step 506, produces thedesired result of satisfying Rule 2 above, and it also allows a singleoutage report to be made listing all possible trunks that might becausing the inferred outage. Note that this intersection set may be asingle trunk or several trunks, and it may or may not include the trunkthat is currently under evaluation. If there are several trunks in theset, these may be contiguous in the network topology (as are trunks 405and 406 in FIG. 4), but they are not necessarily so because there mightbe intervening trunks that are not over their alarm threshold and whichwere therefore excluded.

The result of process 264 is a list of one of more trunks that could bethe location of a fault causing the observed circuit alarms. This listis returned to the Evaluate Circuit Alarm Counts on Trunk process inFIG. 2E. There may be many trunks which are over their circuit-alarmthresholds, which may or may not be in this list, but all of which canbe explained by a fault in this set of trunks. Each of these trunks willbe evaluated separately, and in fact each of these trunks may beevaluated several times as new circuit alarms are received. Therefore, aseparate data structure needs to be maintained to record this inferredfault location. Specifically, this data structure will record whether ornot the inferred fault has already been reported, and it will allowdetection of any change that requires that the report needs to beupdated (such as any shortening or lengthening of the list of possiblefaulted trunks).

Step 265 compares the existing set of such data structures against thelist of trunks produced in step 264 to determine if there is any matchor partial match. If no intersection is found with any previouslyasserted outage, a new data structure is initialized in step 266 torepresent the newly recognized trunk outage. A counter is initialized toone, representing the total number of trunks either directly involved inor indirectly explained by the new outage. This counter will be used todetermine when the outage has cleared. (Note that the counter isinitialized to one, which represents only the trunk currently underevaluation, even if multiple trunks are included in the list. This isbecause each of those trunks will be evaluated and counted separately.)

If any intersection is found with an existing inferred outage datastructure, the association of the new outage to the existing depends onthe precise condition of the correlation between the two sets of trunks.There are four possibilities:

1. The newly asserted set of trunks is identical to a previouslyasserted outage. In this case, no new information about the outage hasbeen determined except that an additional trunk is involved, so the listis unchanged in step 267 but the counter is incremented by one.

2. The newly asserted set of trunks is a proper subset of a previouslyasserted outage. In this case, the list is assumed to be a betterestimation of the outage location (smaller in scope), so the new list oftrunks can simply replace the old one in step 267. The trunk counter isagain incremented by one.

3. The set of trunks in a previously asserted outage is a proper subsetof the newly derived outage. In this case, the larger new set indicatesthat the inferred outage location needs to be increased in topologicalscope (because additional trunks have exceeded their threshold since thefirst evaluation). Again, the new list replaces the old one and thecounter is incremented by one in step 267.

4. The newly asserted set of trunks only partially intersects apreviously asserted outage list. In this case, the inferred outagelocation may need to be expanded, contracted, or simply shiftedsomewhat. This situation can be resolved by taking the union set of thenew and the old lists, then for all of those trunks taking theintersection of all common path sets in the individual trunk datastructures. Like the initial outage set determination, this is theminimal set that explains all the circuit alarms. This new list replacesthe existing one in the inferred outage data structure in step 267 andthe counter is incremented.

An "inferred fault" alarm report can be created at this point (268).This inferred alarm can be inserted back into the main process of FIG.2A and treated as a circuit alarm to be analyzed in relation tohigher-order trunks. Such inferred alarms can be treated much like adirectly reported circuit alarm except that there may be more than onecircuit associated to the alarm (because the fault could not be narrowedto a single trunk at the current level).

Returning now to FIG. 2D, Analyze Trunk Alarm: When all upstream trunksfrom the reported circuit alarm point have been processed (243), theneach of these upstream trunks is checked in step 246' to determine ifany of those trunks has any fault, either directly reported in a trunkalarm or inferred from circuit alarms. If no faulted trunks are found,then the circuit alarm is assumed to a significant, reportable event(247). Otherwise, the circuit alarm is ignored, since the upstream trunkfault explains the circuit alarm.

In step 247, a check should also be made to see if anylower-multiplex-level circuits contained within the subject circuit havealready reported alarms (unless the subject circuit is at the lowestlevel processed by the system.) If there are any, then the number ofreported "sub-circuit" alarms should be compared to the confirmationthreshold. If the number of such alarms is over the threshold, then theevent can be reported as a "confirmed" outage; otherwise, it can bereported as "unconfirmed".

Returning to the top of FIG. 2D: If the input alarm is on a transmissionsystem at the highest multiplex level (241), then a different type ofanalysis is implemented. This analysis intends to correlate outages ontransmission systems that share a common physical route in the network.Although not part of the signal multiplexing hierarchy, fiber-opticcables and radio transmitter towers typically carry multipletransmission systems; therefore damage to a single cable or tower canaffect multiple systems. For the benefit of the Surveillance andRestoration users of the FMS system, such multiple-system outages shouldbe combined into a single report.

To implement this system common-route analysis, the topology database issearched in step 248 to retrieve the ordered list of sites through whichthe given system traverses. A separate set of data structures ismaintained to represent the transmission system outages. One of thesestructures can represent a single system outage (if no other failedsystem shares the same network route) or several individual systemoutages that all share some common route in the network.

This set of system outage structures is searched in step 249 todetermine if the route of the system being processed has anyintersection with any other failed systems. This intersection isdetermined by comparing all adjacent site pairs; specifically, if anypair of consecutive sites in the system route are also a consecutivepair of sites in another system's route (in either order), then thatsite pair is part of the intersection between the two system routes.

If no such intersection is found, then a new system outage datastructure is initialized (251) to represent the new system fault. If anyintersection is found, then the new system outage is associated to it instep 250. If there is only a partial intersection between the givensystem route and a previously reported system outage, then theintersection list of sites is reset to include only the site pairscommon to all systems associated to the multiple-system fault. (Thisintersection can only become shorter.) A counter is incremented torecord the number of systems associated to the outage.

Whether or not any route intersection is found, the transmission systemfault needs to be reported. Again for the benefit of the FMS users, thisreport should indicate if there are any unaffected systems along thatnetwork route. (Restoration is particularly interested in thisinformation, since unaffected systems might be used for trafficrestoration without having to find an alternate network path). Thetopology database is searched in step 252 to retrieve all transmissionsystems that traverse the same set of sites, and any that are notalready associated to the outage are presumed to be "unaffected" in thesystem outage report (253).

Returning now to FIG. 2A: When an alarm condition clears, a messagesimilar to an alarm will be received. Like an alarm, these messages willindicate the specific equipment and the original fault condition that isnow clear. Step 204, Process Cleared Alarm, is fairly straight-forward:

1. The data structure representing that alarm can be deleted.

2. If the alarm was reported to the users as a significant event, thenan alarm clearance report needs to be made.

3. If the alarm was counted as a circuit alarm on one or more trunks,then those counters needs to be decremented.

4. If such a trunk in step 3 had an inferred outage as a result ofcircuit alarms, then it needs to be re-evaluated to see if any outagecan still be inferred. If not, then that inferred alarm needs to becleared. (Since such an inferred trunk alarm was treated like a circuitalarm at the next-higher multiplex level, the cleared alarm must also betreated in a similar manner.)

5. If the circuit alarm counter for a given trunk in step 3 reaches zeroand there have been no directly reported faults on that trunk, then thetrunk data structure can be deleted.

This completes the description of the major processing steps implementedby the invention.

Because this analysis process includes provisions for using fault alarmdata to confirm a traffic-affecting outage, and since the analysis hasalready correlated equipment alarms to network circuits, the output ofthis analysis is readily usable to assess the impact of confirmedoutages to dedicated customer circuits. An auxiliary process can beestablished, driven by the confirmed outages of this process, which usesa customer circuit database to identify the affected customers. Criticalcustomers can be pro-actively notified of the outage. All affectedcustomer circuits can be logged in a database in case later troublereports need to be investigated. If the time of a customer-reportedproblem corresponds to a logged entry, then the problem can positivelybe correlated to the outage.

It should be understood that the invention is not limited to the exactdetails of construction shown and described herein for obviousmodifications will occur to persons skilled in the art.

We claim:
 1. A fault detection method for a telephone network havingmultiplexer and transmission equipments, comprising the steps:sensingfaults occurring in particular components of the network and generatingfault alarm data therefrom; propagating the alarm data downstreamthrough the network for collection at an end point; positioning adatabase at the end point: loading entries in the database thatcharacterize the topology of the network and contain entries relating torouting of circuits and trunks through the network, and moreparticularly define(a) which trunk or ordered sequence of trunks containa given circuit; (b) which circuits are contained within a given trunk;and (c) the topological route through the network for any given circuitor trunk; subjecting the database to the collected alarm data;correlating the collected alarm data with the database for producinginformation regarding(d) significant fault alarm events distinguishedfrom sympathetic events, to determine the topographic point of failure;(e) inference of an outage on a trunk, where no fault alarms aredirectly received from trunk equipment, but where correlated alarms onmultiple circuits contained within the same trunk are detected; and (f)confirmation of an outage, where direct alarms have been reported on atrunk; wherein the step of correlating the collected alarm data with thedatabase further includes the steps of resetting a previously setcircuit alarm counter to zero for all upstream trunks extracted from thetopology database; determining whether a detected circuit alarm is thefirst to be counted on an upstream trunk; in the event it is the firstto be counted, storing an alarm time stamp and a set of all upstreamtrunks for the circuit for which an alarm is detected; in the event itis not the first to be counted, determining whether the alarm has beenreceived within a pre-selected time window relative to a previous countof a corresponding circuit alarm counter; in the event that it is withinthe window, incrementing the circuit alarm counter and determining acommon path set; in the event that it is not within the window,determining whether the fault has been explicitly reported or inferredfor the affected trunk; in the event that it has been explicitlyreported, ignoring the alarm as spurious; in the event that it has notbeen explicitly reported, resetting the circuit alarm counter to 1 andstoring as new alarm data(a) a reset circuit alarm time and; (b) a listof upstream trunks.
 2. The process set forth in claim 1 furthercomprising the step of suppressing storage of an alarm that lacksinformation concerning the location of a fault along a circuit or trunk.3. The process set forth in claim 1 further comprising the step ofinferring an outage on a trunk, in the absence of a direct alarm on thattrunk, if a majority of the monitored circuits contained in that trunkhave reported alarms.
 4. The process set forth in claim 3 furthercomprising the step of comparing, during correlation, the routes offailed circuits in inferred locations for determining the trunk orminimal set of trunks along which an inferred or directly reportedoutage would satisfy all known circuit alarms.
 5. The process set forthin claim 1 further comprising the step of displaying the mostsignificant alarms directly reported by network equipments or inferredfrom multiple alarms, wherein sympathetic alarms, caused by thesignificant alarms propagated downstream, are excluded.
 6. The processset forth in claim 1 further comprising the step of consolidating alldirectly and inferred outages, and determining, from the database,whether alternate routes are available for bypassing the outages.